Architecture Overview
The pipeline is orchestrated byrun_full_pipeline.py, which executes scripts sequentially across six phases:
Phase Breakdown
PHASE 1: Core Data (Foundation)
Creates the foundational datasets that all other scripts depend on.Scripts & Outputs
Scripts & Outputs
| Script | Output | Purpose |
|---|---|---|
fetch_dhan_data.py | dhan_data_response.jsonmaster_isin_map.json | Fetches 2,775 stocks and creates ISIN mapping |
fetch_fundamental_data.py | fundamental_data.json | Quarterly results & financial ratios (35 MB) |
| NSE CSV Download | nse_equity_list.csv | Listing dates for all stocks |
Critical Dependency:
master_isin_map.json is used by ALL scripts in Phase 2, 2.5, and 4. If fetch_dhan_data.py fails, the pipeline cannot continue.PHASE 2: Data Enrichment (Fetching)
Parallel execution of 11 data fetching scripts, all consumingmaster_isin_map.json.
Scripts & Outputs
Scripts & Outputs
| Script | Output | Description |
|---|---|---|
fetch_company_filings.py | company_filings/{SYMBOL}_filings.json | Hybrid LODR + Legacy filings |
fetch_new_announcements.py | all_company_announcements.json | Live corporate announcements |
fetch_advanced_indicators.py | advanced_indicator_data.json | Pivot Points, EMA/SMA signals (8.3 MB) |
fetch_market_news.py | market_news/{SYMBOL}_news.json | AI-sentiment news (50/stock) |
fetch_corporate_actions.py | upcoming_corporate_actions.jsonhistory_corporate_actions.json | Dividends, Bonus, Splits (2 years history + 2 months ahead) |
fetch_surveillance_lists.py | nse_asm_list.jsonnse_gsm_list.json | ASM/GSM surveillance lists |
fetch_circuit_stocks.py | upper_circuit_stocks.jsonlower_circuit_stocks.json | Circuit breaker stocks |
fetch_bulk_block_deals.py | bulk_block_deals.json | Bulk/Block deals (30 days) |
fetch_incremental_price_bands.py | incremental_price_bands.json | Daily price band changes |
fetch_complete_price_bands.py | complete_price_bands.json | All securities price bands |
fetch_all_indices.py | all_indices_list.json | 194 market indices |
PHASE 2.5: OHLCV Data (Smart Incremental)
Optional phase controlled byFETCH_OHLCV flag. Downloads lifetime historical OHLCV data with intelligent incremental updates.
Scripts & Performance
Scripts & Performance
| Script | Output | Performance |
|---|---|---|
fetch_all_ohlcv.py | ohlcv_data/{SYMBOL}.csv | ~2-5 min incremental, ~30 min first-time |
fetch_indices_ohlcv.py | ohlcv_data/indices/{INDEX}.csv | High-speed specialized fetcher |
PHASE 3: Base Analysis (Building Master JSON)
Single critical script that produces the base structure ofall_stocks_fundamental_analysis.json.
bulk_market_analyzer.py Details
bulk_market_analyzer.py Details
Inputs:
fundamental_data.json(Phase 1)dhan_data_response.json(Phase 1)advanced_indicator_data.json(Phase 2)nse_equity_list.csv(Phase 1)
all_stocks_fundamental_analysis.json(Base structure with ~60 fields)
- Loads fundamental data for all 2,775 stocks
- Merges technical data from Dhan response
- Adds advanced indicators (Pivots, SMA/EMA status)
- Calculates QoQ/YoY growth metrics
- Computes valuation ratios (P/E, PEG, ROE, ROCE, D/E)
- Adds shareholding patterns (FII/DII changes, Free Float)
This script MUST complete successfully before Phase 4. All Phase 4 scripts modify this JSON file in-place.
PHASE 4: Enrichment Injection (Order Matters!)
Five scripts that sequentially inject additional fields intoall_stocks_fundamental_analysis.json.
- Order & Dependencies
- Why Order Matters
| Order | Script | Fields Added | Dependencies |
|---|---|---|---|
| 1 | advanced_metrics_processor.py | ADR, RVOL, ATH, Turnover, Gap Up %, Day Range % | ohlcv_data/ |
| 2 | process_earnings_performance.py | Quarterly Results Date, Returns since Earnings, Max Returns since Earnings | company_filings/, ohlcv_data/ |
| 3 | enrich_fno_data.py | F&O Flag, Lot Size, Next Expiry | F&O data fetchers |
| 4 | process_market_breadth.py | Relative Strength Rating, Market Breadth metrics | Returns data from base analysis |
| 5 | process_historical_market_breadth.py | Historical breadth charts | OHLCV data |
| 6 | add_corporate_events.py | Event Markers, Recent Announcements, News Feed | ALL Phase 2 outputs |
PHASE 5: Compression
Compresses final outputs to.json.gz format with maximum compression.
Compression Details
Compression Details
Files Compressed:
all_stocks_fundamental_analysis.json→.json.gz(~80% smaller)sector_analytics.json→.json.gzmarket_breadth.csv→.json.gz
- Raw JSON: ~35-40 MB
- Compressed: ~7-8 MB
- Compression ratio: 80%+
PHASE 6: Optional Standalone Data
Controlled byFETCH_OPTIONAL flag. Produces standalone datasets not included in the master JSON.
Optional Scripts
Optional Scripts
| Script | Output | Description |
|---|---|---|
fetch_all_indices.py | all_indices_list.json | 194 market indices |
fetch_etf_data.py | etf_data_response.json | 361 ETF details |
Configuration Flags
Edit these flags insiderun_full_pipeline.py (lines 60-71):
Impact of Configuration
- FETCH_OHLCV
- CLEANUP_INTERMEDIATE
| Setting | Impact | Runtime | Output Fields |
|---|---|---|---|
True | Full OHLCV download + incremental updates | +2-30 min | All 86 fields populated |
False | Skip OHLCV entirely | Faster (~4 min total) | 15+ fields will be zero |
- ADR (5/14/20/30 Days MA)
- RVOL
- ATH, % from ATH
- Gap Up %, Day Range %
- % from 52W Low
- 6 Month Returns
- 200 Days EMA Volume
- Daily Rupee Turnover (20/50/100)
- 30 Days Average Rupee Volume
- Returns since Earnings
- Max Returns since Earnings
Error Handling Strategy
The pipeline implements a resilient continuation strategy:Critical Failures
If
fetch_dhan_data.py (Phase 1) or bulk_market_analyzer.py (Phase 3) fail, the pipeline stops immediately.These scripts produce the master ISIN map and base JSON that all other scripts depend on.Enrichment Failures
If any Phase 2 or Phase 4 script fails, the pipeline continues and marks the script as failed.This ensures you get a complete output even if individual data sources are temporarily unavailable.
Performance Characteristics
Minimal Run
Configuration:
FETCH_OHLCV = FalseRuntime: ~4 minutesOutput: 60+ fields per stock (missing volume/volatility metrics)Full Run
Configuration:
FETCH_OHLCV = True (incremental)Runtime: ~6-9 minutesOutput: All 86 fields per stockFirst-Time Full Run
Configuration:
FETCH_OHLCV = True (no existing data)Runtime: ~35-40 minutesOutput: All 86 fields + complete OHLCV historyWith Cleanup
Configuration:
CLEANUP_INTERMEDIATE = TrueDisk Saved: ~150-200 MBRetained: Only .json.gz + ohlcv_data/Next Steps
Data Flow
Understand how data transforms across phases
Output Schema
Explore the 86 fields in the final JSON
Quick Start
Run your first pipeline
Configuration
Customize pipeline behavior